This document illustrates the basic principles of performing RNA velocity analysis in R using the velociraptor package, which provides a wrapper around the scVelo python package. The document also contains exercises, indicated by Exercise.
The following resources provide additional details about the various steps covered in the document:
We will use an example data set of pancreatic development obtained
from Bastidas-Ponce et al. (2019). The
RNAVelocityData.html report provides all the details on how to
generate the SingleCellExperiment object that will be used
for the analysis. If a smaller data set is desired, for quicker
execution time, the pancreas data set can be exchanged for the
spermatogenesis data set from Hermann et al.
(2018), also generated in RNAVelocityData.html.
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(scater)
})
sce <- readRDS("pancreas.rds")
sce
#> class: SingleCellExperiment
#> dim: 27998 3696
#> metadata(6): clusters_coarse_colors clusters_colors ... pca HVGs
#> assays(5): X spliced unspliced counts logcounts
#> rownames(27998): Xkr4 Gm37381 ... Gm20837 Erdr1
#> rowData names(1): highly_variable_genes
#> colnames(3696): AAACCTGAGAGGGATA AAACCTGAGCCTTGAT ... TTTGTCATCGAATGCT
#> TTTGTCATCTGTTTGT
#> colData names(5): clusters_coarse clusters S_score G2M_score sizeFactor
#> reducedDimNames(2): X_pca X_umap
#> mainExpName: NULL
#> altExpNames(0):
scater::plotReducedDim(sce, dimred = "X_umap", colour_by = "clusters")Exercise: Confirm that the example data set contains both spliced and unspliced counts. What are the corresponding assay names? What fraction of the total UMI count is assigned to each of the types?
assayNames(sce)
#> [1] "X" "spliced" "unspliced" "counts" "logcounts"
total <- sum(assay(sce, "spliced")) + sum(assay(sce, "unspliced"))
(frac_spliced <- sum(assay(sce, "spliced"))/total)
#> [1] 0.8370933
(frac_unspliced <- sum(assay(sce, "unspliced"))/total)
#> [1] 0.1629067Next, we will perform the RNA velocity estimation. In velociraptor,
this is done using the scvelo() function, which provides a
wrapper around the main functionality of scVelo. Take a moment
to study the documentation of scvelo() - you will see for
example that all the modes of scVelo (steady-state, stochastic,
dynamical) can be used. In addition, the argument
use.theirs can be set to TRUE to perform gene
filtering and normalization using scVelo. If set to
FALSE (default), these steps will be performed in R, which
can be useful in order to increase consistency with the rest of an
R-based pipeline. For a confirmation of the ability to reproduce
scVelo results with velociraptor,
see the CompareVelociraptorScvelo.html report.
Exercise: Run the RNA velocity analysis using the
velociraptor::scvelo() function. Use the dynamical mode,
and provide only the top 1000 of the estimated highly variable genes
(stored in the metadata of the example data). The number and precise
selection of genes to use for the velocity estimation can have a
considerable impact on the results (we will come back to this later),
and the “right” selection will depend on the complexity and
characteristics of the data set at hand. Also, use the 50 nearest
neighbors to calculate moments. This function takes a few minutes to
complete, and you can follow the different steps in the messages printed
by the function. In the meanwhile, think about what you expect as the
output from the function (e.g., object type, dimensions).
suppressPackageStartupMessages({
library(velociraptor)
})
velo.out <- velociraptor::scvelo(
x = sce,
subset.row = metadata(sce)$HVGs[1:1000],
mode = "dynamical",
scvelo.params = list(moments = list(n_neighbors = 50L))
)Exercise:
velo.out?velo.out
#> class: SingleCellExperiment
#> dim: 1000 3696
#> metadata(5): neighbors recover_dynamics velocity_params velocity_graph
#> velocity_graph_neg
#> assays(10): X spliced ... velocity velocity_u
#> rownames(1000): Pyy Iapp ... Rrbp1 Emc10
#> rowData names(18): fit_r2 fit_alpha ... velocity_genes varm
#> colnames(3696): AAACCTGAGAGGGATA AAACCTGAGCCTTGAT ... TTTGTCATCGAATGCT
#> TTTGTCATCTGTTTGT
#> colData names(8): velocity_self_transition root_cells ...
#> velocity_confidence velocity_confidence_transition
#> reducedDimNames(1): X_pca
#> mainExpName: NULL
#> altExpNames(0):Below we will go through the different steps covered by
scvelo() in more detail, and see where the output is stored
in the returned object.
The first step performed by scvelo() is to estimate
moments (mean and uncentered variance) of the abundances for each cell.
The moments are calculated across a small set of neighboring cells (in
the PCA space), for increased stability (think of it as a kind of
smoothing, or imputation). Two layers, Ms and
Mu, which are the first order moments (means) for spliced
and unspliced abundances, are added to the data object.
## nearest neighbors (note that the nearest neighbor to each cell is the
## cell itself)
head(metadata(velo.out)$neighbors$indices)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,] 0 638 1441 889 1099 3051 1909 3332 37 3171 216 909 1821 2497
#> [2,] 1 3300 2952 1555 1934 1896 974 856 2802 1814 368 2341 2853 2772
#> [3,] 2 1215 1830 2755 1254 1573 1689 604 3497 3065 1172 1778 842 3500
#> [4,] 3 1257 222 153 2654 3642 2095 3429 905 2042 1508 3297 1273 2398
#> [5,] 4 2737 1091 2140 1160 3549 481 3059 2580 978 2322 315 503 118
#> [6,] 5 1132 1499 827 1145 2916 505 826 2178 2091 965 3448 3663 2725
#> [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,] 1778 3254 1166 3539 2127 1052 1438 3477 3268 33 83 716
#> [2,] 1807 3372 2233 3567 873 2614 2988 950 2374 1945 1140 2134
#> [3,] 2996 205 1802 1644 2989 478 1554 3247 2480 3627 1775 3624
#> [4,] 2993 2049 901 894 1340 1681 128 1969 144 3466 982 820
#> [5,] 1811 402 3106 1183 195 892 214 2508 297 866 2081 235
#> [6,] 1871 991 328 368 2169 3416 3300 1497 1569 1051 3306 1594
#> [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
#> [1,] 1645 603 3153 204 698 3374 2776 2023 1558 1580 1769 2733
#> [2,] 342 1667 1054 1460 3236 1132 3002 1754 1402 752 967 2636
#> [3,] 3618 1248 2258 663 3481 1747 817 514 277 2130 3268 323
#> [4,] 563 2778 2183 701 928 1732 1180 3511 1328 2110 194 1851
#> [5,] 1307 925 681 2880 1952 504 1588 1269 2893 2746 609 2639
#> [6,] 933 122 1651 816 2291 1750 3438 3022 1196 2381 2892 3032
#> [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
#> [1,] 2277 3076 1514 140 3619 2064 2083 695 3623 1479 2438 2885
#> [2,] 281 2149 836 2763 3041 1581 3468 488 2070 1766 2091 141
#> [3,] 2017 2406 246 3623 133 3476 2175 2385 2159 733 1068 2426
#> [4,] 1293 997 3605 272 1823 1016 3423 747 2818 471 301 2190
#> [5,] 2024 3233 3206 3193 3367 722 2135 2713 1904 331 3428 3406
#> [6,] 2658 2283 1896 3630 467 836 1115 3256 2763 2233 3567 1531
## Note Ms and Mu assays
assayNames(velo.out)
#> [1] "X" "spliced" "unspliced" "Ms" "Mu"
#> [6] "fit_t" "fit_tau" "fit_tau_" "velocity" "velocity_u"Exercise: Investigate how the number of neighbors
impacts the phase plot (the scatterplot of spliced vs unspliced
normalized abundances). While there are ways of running individual
functions from scVelo directly in the R session (e.g., as you
have seen earlier today using reticulate), the easiest way
here may be to run velociraptor::scvelo() repeatedly, with
different number of neighbors in the moment calculations. Since we are
not interested in the velocity estimates at this point, you can use the
steady-state mode, which is much faster than the dynamical mode. For
each number of neighbors, generate the phase plot for specific genes
(e.g., Sulf2 and Xist) using the
velociraptor::plotVelocity() function.
out5 <- velociraptor::scvelo(
x = sce,
subset.row = metadata(sce)$HVGs[1:1300],
mode = "steady_state",
scvelo.params = list(moments = list(n_neighbors = 5L))
)
out5$clusters <- sce$clusters
plotVelocity(out5, genes = c("Sulf2", "Xist"), genes.per.row = 2,
color_by = "clusters", which.plots = "phase")
out30 <- velociraptor::scvelo(
x = sce,
subset.row = metadata(sce)$HVGs[1:1300],
mode = "steady_state",
scvelo.params = list(moments = list(n_neighbors = 30L))
)
out30$clusters <- sce$clusters
plotVelocity(out30, genes = c("Sulf2", "Xist"), genes.per.row = 2,
color_by = "clusters", which.plots = "phase")
out100 <- velociraptor::scvelo(
x = sce,
subset.row = metadata(sce)$HVGs[1:1300],
mode = "steady_state",
scvelo.params = list(moments = list(n_neighbors = 100L))
)
out100$clusters <- sce$clusters
plotVelocity(out100, genes = c("Sulf2", "Xist"), genes.per.row = 2,
color_by = "clusters", which.plots = "phase")
out500 <- velociraptor::scvelo(
x = sce,
subset.row = metadata(sce)$HVGs[1:1300],
mode = "steady_state",
scvelo.params = list(moments = list(n_neighbors = 500L))
)
out500$clusters <- sce$clusters
plotVelocity(out500, genes = c("Sulf2", "Xist"), genes.per.row = 2,
color_by = "clusters", which.plots = "phase")Once the data is preprocessed, the next step fits the model and infers transcription rates, splicing rates and degradation rates for each gene, as well as cell-specific latent times and transcriptional states. scVelo uses an EM algorithm for the estimation, iterating between the E-step (where each observation is assigned a time and a state (induction, repression, or steady state)) and the M-step (where the rate parameters are estimated), as outlined in the figure below from Bergen et al. (2020).
This step is only required for the dynamical model. It adds several
columns to rowData(sce) (see https://scvelo.readthedocs.io/DynamicalModeling.html),
including:
fit_r2). Note that
this can be negative, if the obtained fit is worse than just using a
straight line at the mean. This is used to determine which genes are
used for the downstream analysis and projection of velocities into a
low-dimensional space.fit_alpha)fit_beta)fit_gamma)fit_t_)fit_likelihood),
averaged across all cells. The likelihood value for a gene and a cell
indicates how well the cell is described by the learned phase
trajectory.head(rowData(velo.out)[rowData(velo.out)$velocity_genes, ])
#> DataFrame with 6 rows and 18 columns
#> fit_r2 fit_alpha fit_beta fit_gamma fit_t_ fit_scaling fit_std_u
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> Xist 0.0816643 2.31238 26.163111 0.167319 11.70007 0.0194907 0.0314970
#> Cpe 0.7659820 3.28512 1.993591 0.189969 15.16518 0.1977422 0.7048391
#> Isl1 0.9152534 39.56635 44.674312 1.824373 12.12729 0.0162127 0.2135069
#> Nnat 0.9324625 28.09730 2.580149 0.357863 19.76561 0.0722484 2.2767632
#> Clu 0.4300578 8.67251 34.744106 0.374148 6.60626 0.0174170 0.0828731
#> Ins2 0.9949970 36.26005 0.130333 0.268981 15.37212 1.4386132 47.8505974
#> fit_std_s fit_likelihood fit_u0 fit_s0 fit_pval_steady fit_steady_u
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> Xist 2.79972 0.188480 0 0 0.495581 0.113795
#> Cpe 5.88132 0.601454 0 0 0.487579 1.561235
#> Isl1 6.36506 0.422870 0 0 0.498028 0.635709
#> Nnat 14.18085 0.360123 0 0 0.499801 11.243400
#> Clu 7.24431 0.366780 0 0 0.495890 0.200323
#> Ins2 17.37919 0.304236 0 0 0.499925 312.344015
#> fit_steady_s fit_variance fit_alignment_scaling velocity_genes
#> <numeric> <numeric> <numeric> <logical>
#> Xist 8.23593 1.423112 3.057727 TRUE
#> Cpe 11.83513 0.149404 4.185511 TRUE
#> Isl1 15.82331 0.299459 0.667317 TRUE
#> Nnat 46.33812 0.142184 2.414013 TRUE
#> Clu 17.59529 0.325777 2.156530 TRUE
#> Ins2 120.53970 0.364432 3.031129 TRUE
#> varm
#> <DataFrame>
#> Xist 3311.419:3522.173:4363.661:...
#> Cpe 623.416: 640.433: 486.004:...
#> Isl1 887.846: 408.402: 395.644:...
#> Nnat 280.220: 135.465: 121.017:...
#> Clu 727.547: 680.266: 641.781:...
#> Ins2 280.605: 244.654: 240.003:...Once the kinetic rate parameters are estimated, the actual velocities
are estimated based on these. This step adds a velocity layer to the
data object, and the velocity_genes column in the row data.
This column indicates whether the fit for a gene is considered ‘good
enough’ for downstream use. Specifically, it requires
fit_r2, fit_likelihood and
fit_gamma to exceed certain thresholds,
fit_scaling to be within a certain range, and that both the
unspliced and spliced mean values are nonzero (see the scVelo
code for more details).
## Recreate the determination of velocity genes based on the estimated values and
## check that it agrees with the velocity_genes column returned by scVelo.
rd <- rowData(velo.out)
perc <- quantile(rd$fit_scaling[!is.na(rd$fit_scaling)], probs = c(0.1, 0.9))
table(crit = rd$fit_r2 > 0.01 &
rd$fit_gamma > 0.01 &
rd$fit_likelihood > 0.001 &
rd$fit_scaling > min(perc[1], 0.03) &
rd$fit_scaling < max(perc[2], 3) &
rowSums(assay(velo.out, "Ms")) > 0 &
rowSums(assay(velo.out, "Mu")) > 0,
vel = rd$velocity_genes, useNA = "ifany")
#> vel
#> crit FALSE TRUE
#> FALSE 316 0
#> TRUE 0 537
#> <NA> 147 0At this point, we have estimated the velocities - these are vectors
in a \(K\)-dimensional space, where
\(K\) is the number of retained genes.
In order to use these velocities for downstream applications, such as
estimating the future state of an individual cell or generating
low-dimensional visualizations, we next estimate a so called
velocity graph. To this end, scVelo calculates cosine
similarities between the velocity vector for each cell and the
displacement vector from that cell to each other (neighboring) cell:
\[\pi_{ij}=cos\angle(\delta_{ij},
v_i),\]where \(\delta_{ij}=s_j-s_i\) is the displacement
vector from cell \(i\) to cell \(j\) in gene expression space, and \(v_i\) is the velocity vector of cell \(i\). The cosine similarity takes values
between -1 and +1, and a large value indicates that cell \(i\) has a high probability of transitioning
towards cell \(j\) (since its velocity
vector points towards cell \(j\)). The
velocity graph is stored in the uns slot of the data object
(and subsequently propagated to the metadata of the
returned SingleCellExperiment), and is represented by a
sparse \(N\times N\) matrix (where
\(N\) is the number of cells). There is
also a velocity_graph_neg, which is a matrix of the same
size as velocity_graph, containing the negative cosine
similarities.
## Velocity graph
dim(metadata(velo.out)$velocity_graph)
#> [1] 3696 3696
metadata(velo.out)$velocity_graph[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgTMatrix"
#>
#> [1,] . . 0.2597513 . .
#> [2,] . . . . .
#> [3,] . . . . .
#> [4,] . . . . .
#> [5,] . . . . .If you look closely, you note that the number of non-zero elements in
each row of the velocity graph (after considering both positive and
negative cosine distances) is larger than the number of neighbors. The
reason for this is that with default settings, scVelo performs
a recursive neighbor finding (with two iterations). This can be
controlled by the n_recurse_neighbors arguments to the
velocity graph calculations.
It is often of interest to obtain an ordering of the cells along a trajectory. scVelo provides two different approaches for this: pseudotime and latent time. The velocity pseudotime is obtained via a diffusion random walk on the velocity graph, and measures how many steps (on average) it takes to reach a given cell from one of the root cells. The root cells are obtained from the transition matrix. The latent time is obtained from the transcriptional dynamics fit, by relating gene-specific times (position along the phase curve) to a “universal” latent time, shared across genes.
head(colData(velo.out))
#> DataFrame with 6 rows and 8 columns
#> velocity_self_transition root_cells end_points
#> <numeric> <numeric> <numeric>
#> AAACCTGAGAGGGATA 0.353678 1.31333e-08 1.08882e-06
#> AAACCTGAGCCTTGAT 0.556256 1.16838e-01 9.29647e-07
#> AAACCTGAGGCAATTA 0.603827 9.33038e-09 2.21098e-06
#> AAACCTGCATCATCCC 0.443273 7.14510e-01 3.93555e-07
#> AAACCTGGTAAGTGGC 0.181264 9.07078e-08 1.46003e-06
#> AAACCTGGTATTAGCC 0.515596 1.64557e-01 9.08715e-07
#> velocity_pseudotime latent_time velocity_length
#> <numeric> <numeric> <numeric>
#> AAACCTGAGAGGGATA 0.85800135 0.8851663 10.53
#> AAACCTGAGCCTTGAT 0.03519455 0.0919535 11.51
#> AAACCTGAGGCAATTA 0.92542219 0.9843534 8.85
#> AAACCTGCATCATCCC 0.00694828 0.0289450 9.77
#> AAACCTGGTAAGTGGC 0.63451403 0.6756038 11.81
#> AAACCTGGTATTAGCC 0.01391311 0.0914198 8.22
#> velocity_confidence velocity_confidence_transition
#> <numeric> <numeric>
#> AAACCTGAGAGGGATA 0.867448 0.3688718
#> AAACCTGAGCCTTGAT 0.891550 0.1025469
#> AAACCTGAGGCAATTA 0.805561 0.0212692
#> AAACCTGCATCATCCC 0.844067 0.3257531
#> AAACCTGGTAAGTGGC 0.928005 0.5371662
#> AAACCTGGTATTAGCC 0.802261 0.1929993Note that the results of scvelo() are not added to the
input SingleCellExperiment object, since the two objects
can have different dimensions (in particular note the difference in the
number of genes). Thus, in order to use information from the velocity
calculations together with information from the original object, we need
to copy slots from one object to the other.
Exercise: Copy the reduced dimension representations
and cell type annotations from the original
SingleCellExperiment object to the output of
scvelo(), and use the latter to construct UMAP
representations colored by various parameters (e.g., the cell types,
velocity pseudotime and latent time).
stopifnot(all(colnames(sce) == colnames(velo.out)))
reducedDims(velo.out) <- reducedDims(sce)
velo.out$clusters <- sce$clusters
cowplot::plot_grid(
scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "clusters"),
scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_pseudotime"),
scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "latent_time"),
ncol = 1, align = "v"
)As we already saw above, the rowData and
colData of velo.out contain the values of
various parameters estimated by scvelo():
head(rowData(velo.out)[rowData(velo.out)$velocity_genes, ])
#> DataFrame with 6 rows and 18 columns
#> fit_r2 fit_alpha fit_beta fit_gamma fit_t_ fit_scaling fit_std_u
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> Xist 0.0816643 2.31238 26.163111 0.167319 11.70007 0.0194907 0.0314970
#> Cpe 0.7659820 3.28512 1.993591 0.189969 15.16518 0.1977422 0.7048391
#> Isl1 0.9152534 39.56635 44.674312 1.824373 12.12729 0.0162127 0.2135069
#> Nnat 0.9324625 28.09730 2.580149 0.357863 19.76561 0.0722484 2.2767632
#> Clu 0.4300578 8.67251 34.744106 0.374148 6.60626 0.0174170 0.0828731
#> Ins2 0.9949970 36.26005 0.130333 0.268981 15.37212 1.4386132 47.8505974
#> fit_std_s fit_likelihood fit_u0 fit_s0 fit_pval_steady fit_steady_u
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> Xist 2.79972 0.188480 0 0 0.495581 0.113795
#> Cpe 5.88132 0.601454 0 0 0.487579 1.561235
#> Isl1 6.36506 0.422870 0 0 0.498028 0.635709
#> Nnat 14.18085 0.360123 0 0 0.499801 11.243400
#> Clu 7.24431 0.366780 0 0 0.495890 0.200323
#> Ins2 17.37919 0.304236 0 0 0.499925 312.344015
#> fit_steady_s fit_variance fit_alignment_scaling velocity_genes
#> <numeric> <numeric> <numeric> <logical>
#> Xist 8.23593 1.423112 3.057727 TRUE
#> Cpe 11.83513 0.149404 4.185511 TRUE
#> Isl1 15.82331 0.299459 0.667317 TRUE
#> Nnat 46.33812 0.142184 2.414013 TRUE
#> Clu 17.59529 0.325777 2.156530 TRUE
#> Ins2 120.53970 0.364432 3.031129 TRUE
#> varm
#> <DataFrame>
#> Xist 3311.419:3522.173:4363.661:...
#> Cpe 623.416: 640.433: 486.004:...
#> Isl1 887.846: 408.402: 395.644:...
#> Nnat 280.220: 135.465: 121.017:...
#> Clu 727.547: 680.266: 641.781:...
#> Ins2 280.605: 244.654: 240.003:...
head(colData(velo.out))
#> DataFrame with 6 rows and 9 columns
#> velocity_self_transition root_cells end_points
#> <numeric> <numeric> <numeric>
#> AAACCTGAGAGGGATA 0.353678 1.31333e-08 1.08882e-06
#> AAACCTGAGCCTTGAT 0.556256 1.16838e-01 9.29647e-07
#> AAACCTGAGGCAATTA 0.603827 9.33038e-09 2.21098e-06
#> AAACCTGCATCATCCC 0.443273 7.14510e-01 3.93555e-07
#> AAACCTGGTAAGTGGC 0.181264 9.07078e-08 1.46003e-06
#> AAACCTGGTATTAGCC 0.515596 1.64557e-01 9.08715e-07
#> velocity_pseudotime latent_time velocity_length
#> <numeric> <numeric> <numeric>
#> AAACCTGAGAGGGATA 0.85800135 0.8851663 10.53
#> AAACCTGAGCCTTGAT 0.03519455 0.0919535 11.51
#> AAACCTGAGGCAATTA 0.92542219 0.9843534 8.85
#> AAACCTGCATCATCCC 0.00694828 0.0289450 9.77
#> AAACCTGGTAAGTGGC 0.63451403 0.6756038 11.81
#> AAACCTGGTATTAGCC 0.01391311 0.0914198 8.22
#> velocity_confidence velocity_confidence_transition
#> <numeric> <numeric>
#> AAACCTGAGAGGGATA 0.867448 0.3688718
#> AAACCTGAGCCTTGAT 0.891550 0.1025469
#> AAACCTGAGGCAATTA 0.805561 0.0212692
#> AAACCTGCATCATCCC 0.844067 0.3257531
#> AAACCTGGTAAGTGGC 0.928005 0.5371662
#> AAACCTGGTATTAGCC 0.802261 0.1929993
#> clusters
#> <factor>
#> AAACCTGAGAGGGATA Pre-endocrine
#> AAACCTGAGCCTTGAT Ductal
#> AAACCTGAGGCAATTA Alpha
#> AAACCTGCATCATCCC Ductal
#> AAACCTGGTAAGTGGC Ngn3 high EP
#> AAACCTGGTATTAGCC DuctalMore information about the interpretation of these parameters can be found in the scVelo documentation. Let’s, for example, plot the distribution of transcription, splicing and degradation rates for all velocity genes with fit likelihood larger than 0.1:
suppressPackageStartupMessages({
library(dplyr)
})
plotdf <- as.data.frame(rowData(velo.out)) %>%
dplyr::filter(fit_likelihood > 0.1 & velocity_genes)
dim(plotdf)
#> [1] 532 39
cowplot::plot_grid(
ggplot(plotdf, aes(x = fit_alpha)) + geom_histogram(bins = 50) +
theme_minimal() + scale_x_log10(),
ggplot(plotdf, aes(x = fit_beta)) + geom_histogram(bins = 50) +
theme_minimal() + scale_x_log10(),
ggplot(plotdf, aes(x = fit_gamma)) + geom_histogram(bins = 50) +
theme_minimal() + scale_x_log10(),
nrow = 1
)The velociraptor package also provides plotting functions to visualize the RNA velocity results. For example, we can embed the estimated velocities into a provided low-dimensional space. In order to visualize the velocities in a lower-dimensional embedding, we convert the cosine similarities in the velocity graph to transition probabilities of cell-to-cell transitions by applying an exponential kernel: \[\tilde{\pi}_{ij}=\frac{1}{z_i}exp(\frac{\pi_{ij}}{\sigma_i^2}).\] The \(z_i\) are normalization factors and \(\sigma_i\) is an adaptive kernel width parameter. These transition probabilities are used to project the velocities into a low-dimensional embedding. This is achieved by weighting the normalized displacement vectors from a cell to all other cells in the low-dimensional space by the transition probabilities for cell, and taking the resulting weighted average as the low-dimensional velocity vector. More precisely, if \[\tilde{\delta}_{ij}=\frac{\tilde{s}_j-\tilde{s}_i}{\|\tilde{s}_j-\tilde{s}_i\|}\]are the normalized displacement vectors in the low-dimensional embedding, the embedded velocity is estimated by \[\tilde{v}_i=\sum_{j\neq i}\left(\tilde{\pi}_{ij} - \frac{1}{n}\right)\tilde{\delta}_{ij}.\] The \(1/n\) term is included to make the projected velocity zero when the transition probabilities represent a uniform distribution.
suppressPackageStartupMessages({
library(ggplot2)
})
## PCA
embedded <- velociraptor::embedVelocity(reducedDim(velo.out, "X_pca")[, 1:2], velo.out)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velo.out, "X_pca")[, 1:2], embedded)
plotReducedDim(velo.out, dimred = "X_pca", colour_by = "velocity_pseudotime") +
geom_segment(data = grid.df,
mapping = aes(x = start.1, y = start.2,
xend = end.1, yend = end.2),
arrow = arrow(length = unit(0.05, "inches")))
## UMAP
embedded <- velociraptor::embedVelocity(reducedDim(velo.out, "X_umap"), velo.out)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velo.out, "X_umap"), embedded)
plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_pseudotime") +
geom_segment(data = grid.df,
mapping = aes(x = start.1, y = start.2,
xend = end.1, yend = end.2),
arrow = arrow(length = unit(0.05, "inches")))Next, we will look at individual genes via their phase plots. This is often useful in order to understand how the velocities are affected/supported by specific genes. We can of course plot genes that we are already familiar with and that we know are related to the process of interest. We can also extract genes with particularly strong influence on the velocity results. ‘Driver genes’, which display a strong dynamic behavior, can for example be detected via their high likelihoods in the dynamic model.
Exercise: Use the
velociraptor::plotVelocity() function to generate phase
plots and UMAP embedding plots colored by velocity and gene expression,
respectively, for the six genes with the largest fit likelihoods.
toplh <- rownames(as.data.frame(rowData(velo.out)) %>%
dplyr::arrange(desc(fit_likelihood)) %>%
head())
toplh
#> [1] "Pcsk2" "Ank" "Rps3" "Tmem163" "Dcdc2a" "Btbd17"
plotVelocity(velo.out, use.dimred = "X_umap", genes = toplh[1:6],
color_by = "clusters")Finally, we will illustrate two summary statistics returned by scVelo:
cowplot::plot_grid(
scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_length"),
scater::plotReducedDim(velo.out, dimred = "X_umap", colour_by = "velocity_confidence"),
ncol = 1, align = "v"
)In the analysis above we used the most highly variable genes in the data set. However, the velocity analysis can be strongly dependent on the gene selection, especially if there are multiple dynamic processes at play in the data set.
Exercise: Rerun the velocity estimation using only genes found to be associated with the cell cycle, and plot the velocity embedding. For example, Macosko et al. (2015) provide a list of 668 mouse genes with expression patterns that varied along the cell cycle at a false discovery rate of 5% (Table S2 of https://www.cell.com/fulltext/S0092-8674(15)00549-8).
cell_cycle_genes <- read.delim("Macosko_TableS2_mouse.txt", header = FALSE)[, 1]
length(cell_cycle_genes)
#> [1] 668
length(intersect(cell_cycle_genes, rownames(sce)))
#> [1] 658
cell_cycle_genes <- intersect(cell_cycle_genes, rownames(sce))
velocc <- velociraptor::scvelo(sce, subset.row = cell_cycle_genes,
mode = "dynamical")
reducedDims(velocc) <- reducedDims(sce)
velocc$clusters <- sce$clusters
embedded <- velociraptor::embedVelocity(reducedDim(velocc, "X_umap"), velocc)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velocc, "X_umap"), embedded)
plotReducedDim(velocc, "X_umap", colour_by = "clusters") +
geom_segment(data = grid.df,
mapping = aes(x = start.1, y = start.2,
xend = end.1, yend = end.2),
arrow = arrow(length = unit(0.05, "inches")))Similarly, run the velocity analysis on the highly variable genes that are not also among the cell cycle-associated genes. What do you expect in this case?
velonocc <- velociraptor::scvelo(sce,
subset.row = setdiff(metadata(sce)$HVGs[1:1000], cell_cycle_genes),
mode = "dynamical")
reducedDims(velonocc) <- reducedDims(sce)
velonocc$clusters <- sce$clusters
embedded <- velociraptor::embedVelocity(reducedDim(velonocc, "X_umap"), velonocc)
#> ℹ Using the 'X' assay as the X matrix
grid.df <- velociraptor::gridVectors(reducedDim(velonocc, "X_umap"), embedded)
plotReducedDim(velonocc, "X_umap", colour_by = "clusters") +
geom_segment(data = grid.df,
mapping = aes(x = start.1, y = start.2,
xend = end.1, yend = end.2),
arrow = arrow(length = unit(0.05, "inches")))While velociraptor wraps the main velocity workflow from scVelo, the latter contains also additional functionality that may be useful. Most of these functions are, however, also accessible from within R, by explicitly calling them in a basilisk environment. Here, we will illustrate this by extracting cluster-wise “velocity genes”, which are genes that show differences in velocity between clusters. See the scVelo documentation for more details.
## Start basilisk process using the environment from velociraptor
velproc <- basilisk::basiliskStart(velociraptor:::velo.env)
## Call a function inside the basilisk process
gr <- basilisk::basiliskRun(proc = velproc, function(sce) {
scv <- reticulate::import("scvelo")
ad <- zellkonverter::SCE2AnnData(sce) ## Convert SCE to AnnData
scv$tl$rank_velocity_genes(ad, groupby = 'clusters') ## Apply function
df = scv$get_df(ad, 'rank_velocity_genes/names') ## Extract results
as.data.frame(df) ## Return pure R object
}, sce = velo.out)
#> ℹ Using the 'X' assay as the X matrix
## Close the process
basilisk::basiliskStop(velproc)
head(gr)
#> Ductal Ngn3 low EP Ngn3 high EP Pre-endocrine Beta Alpha Delta
#> 0 Krt8 Idh2 Vwa5b2 Abcc8 Nnat Rab27a Atp1a1
#> 1 St3gal4 Ndrg1 Tecr Scg5 Sphkap Map1b Rpl22l1
#> 2 Pld3 Kdm7a Samd14 Syt7 Slc31a2 Elavl4 Anxa5
#> 3 Rps3 Spint2 Id2 Pcsk2os1 Ppp1r1a Papss2 Cldn11
#> 4 Spint2 2010107G23Rik Atp2a3 Pcsk2 Pam Cpt2 Tmem27
#> 5 Errfi1 Hpgd Tmem258 Gstz1 Slc30a8 Rnaseh2c Mapt
#> Epsilon
#> 0 Prdx4
#> 1 Rab27a
#> 2 Syt13
#> 3 Ush1c
#> 4 Dock11
#> 5 Tmpo
## Genes with different velocity in Beta cells
plotVelocity(velo.out, use.dimred = "X_umap", genes = gr$Beta[1:3],
color_by = "clusters")## Genes with different velocity in Ductal cells
plotVelocity(velo.out, use.dimred = "X_umap", genes = gr$Ductal[1:3],
color_by = "clusters")